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A Hierarchical Approach for Joint Multi-view Object Pose Estimation and 

Categorization 


Mete Ozay^, Krzysztof Walas^’^ and Ales Leonardis^ 


Abstract —We propose a joint object pose estimation and 
categorization approach which extracts information about ob¬ 
ject poses and categories from the object parts and compo¬ 
sitions constructed at different layers of a hierarchical object 
representation algorithm, namely Learned Hierarchy of Parts 
(LHOP) [7]. In the proposed approach, we first employ the 
LHOP to learn hierarchical part libraries which represent 
entity parts and compositions across different object categories 
and views. Then, we extract statistical and geometric features 
from the part realizations of the objects in the images in order 
to represent the information about object pose and category 
at each different layer of the hierarchy. Unlike the traditional 
approaches which consider specific layers of the hierarchies 
in order to extract information to perform specific tasks, we 
combine the information extracted at different layers to solve a 
joint object pose estimation and categorization problem using 
distributed optimization algorithms. We examine the proposed 
generative-discriminative learning approach and the algorithms 
on two benchmark 2-D multi-view image datasets. The pro¬ 
posed approach and the algorithms outperform state-of-the-art 
classification, regression and feature extraction algorithms. In 
addition, the experimental results shed light on the relationship 
between object categorization, pose estimation and the part 
realizations observed at different layers of the hierarchy. 

I. Introduction 

The field of service robots aims to provide robots with 
functionalities which allow them to work in man-made 
environments. For instance, the robots should be able to 
categorize objects and estimate the pose of the objects to 
accomplish various robotics tasks, such as grasping objects 
[14]. Representation of object categories enables the robot 
to further refine the grasping strategy by giving context to 
the search for the pose of the object [15]. 

In this paper, we propose a joint object categorization and 
pose estimation approach which extract information about 
statistical and geometric properties of object poses and cate¬ 
gories extracted from the object parts and compositions that 
are constructed at different layers of the Learned Hierarchy 
of Parts (LHOP) [7], [8], [9]. 

In the proposed approach, we first employ LHOP [7], [8] 
to learn hierarchical part libraries which represent object 
parts and compositions across different object categories 
and views as shown in Fig. Then, we extract statistical 
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Fig. 1; Combination of features extracted from part realiza¬ 
tions detected at different layers of LHOP. 


and geometric features from the part realizations of the 
objects in the images in order to represent the information 
about the object pose and category at each different layer 
of the hierarchy. We propose two novel feature extraction 
algorithms, namely Histogram of Oriented Parts (HOP) and 
Entropy of Part Graphs. HOP features measure local distri¬ 
butions of global orientations of part realizations of objects 
at different layers of a hierarchy. On the other hand. Entropy 
of Part Graphs provides information about the statistical and 
geometric structure of object representations by measuring 
the entropy of the relative orientations of parts. In addition, 
we compute a Histogram of Oriented Gradients (HOG) [5] 
of part realizations in order to obtain information about the 
co-occurrence of the gradients of part orientations. 

Unlike traditional approaches which extract information 
from the object representations at specific layers of the 
hierarchy to accomplish specific tasks, we combine the 
information extracted at different layers to solve a joint 
object pose estimation and categorization problem using a 
distributed optimization algorithm. Eor this purpose, we first 
formulate the joint object pose estimation and categorization 
problem as a sparse optimization problem called Group 
Lasso [19]. We consider the pose estimation problem as 
a sparse regression problem and the object categorization 
problem as a multi-class logistic regression problem using 
Group Lasso. Then, we solve the optimization problems 
using a distributed and parallel optimization algorithm called 
the Alternating Direction Method of Multipliers (ADMM) 
[ 1 ]. 
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In this work, we extract information on object poses and 
categories from 2-D images to handle the cases where 3- 
D sensing may not be available or may be unreliable (e.g. 
glass, metal objects). We examine the proposed approach 
and the algorithms on two benchmark 2-D multiple-view 
image datasets. The proposed approach and the algorithms 
outperform state-of-the-art Support Vector Machine and Re¬ 
gression algorithms. In addition, the experimental results 
shed light on the relationship between object categorization, 
pose estimation and the part realizations observed at different 
layers of the hierarchy. 

In the next section, related work is reviewed and the 
novelty of our proposed approach is summarized. In Section 
[H] a brief presentation of the hierarchical compositional 
representation is given. Feature extraction algorithms are 
introduced in Section The joint object pose estimation 
and categorization problem is defined, and two algorithms 
are proposed to solve the optimization problem in Section 
IV Experimental analyses are given in Section |V] Section 
VI concludes the paper. 

A. Related Work and Contribution 

In the field of computer vision the problem of object 
categorization and pose estimation is studied thoroughly and 
some of the approaches are proliferating to the robotics 
community. With an advent of devices based on PrimeSense 
sensors, uni-modal 3-D or multi-modal integration of 2-D 
and 3-D data (e.g. rgb-d data) have been widely used by 
robotics researchers [13]. However, 3-D sensing may not be 
available or reliable due to limitations of object structures, 
lighting resources and imaging conditions in many cases 
where single or multiple view 2-D images are used for 
categorization and pose estimation [3], [4], [20]. In [20], 
a probabilistic approach is proposed to estimate the pose of 
a known object using a single image. Collet et al. [3] build 
3D models of objects using SIFT features extracted from 2D 
images for robotic manipulation, and combine single image 
and multiple image object recognition and pose estimation 
algorithms in a framework in [4]. 

A promising approach to the object categorization and the 
scene description is the use of hierarchical compositional ar¬ 
chitectures [7], [9], [15]. Compositional hierarchical models 
are constructed for object categorization and detection using 
single images in [7], [9]. Multiple view images are used 
for pose estimation and categorization using a hierarchical 
architecture in [15]. In the aforementioned approaches, the 
tasks are performed using either discriminative or generative 
top-down or bottom-up learning approaches in architectures. 
For instance, Fai et al. employ a top-down categorization 
and pose estimation approach in [15], where a different 
task is performed at each different layer of the hierarchy. 
Note that, a categorization error occurring at the top-layer 
of the hierarchy may propagate to the lower layer and affect 
the performance of other tasks such as pose estimation in 
this approach. In our proposed approach, we first construct 
generative representations of object shapes using FHOP [7], 
[8], [9]. Then, we train discriminative models by extracting 


features from the object representations. In addition, we 
propose a new method, which enables us to combine the 
information extracted at each different layer of the hierarchy, 
for joint categorization and pose estimation of objects. We 
avoid the propagation of errors of performing multiple tasks 
through the layers and enable the shareability of parts among 
layers by the employment of optimization algorithms in each 
layer in a parallel and distributed learning framework. 

The novelty of the proposed approach and the paper can 
be summarized as follows; 

1) In this work, the Teamed Hierarchy of Parts (FHOP) 
is employed in order to learn a hierarchy of parts using 
the shareability of parts across different views as well 
as different categories [7], [8]. 

2) Two novel feature extraction algorithms, namely His¬ 
togram of Oriented Parts (HOP) and Entropy of Part 
Graphs, are proposed in order to obtain information 
about the statistical and geometric structure of objects’ 
shapes represented at different layers of the hierarchy 
using part realizations. 

3) The proposed generative-discriminative approach en¬ 
ables us to combine the information extracted at dif¬ 
ferent layers in order to solve a joint object pose esti¬ 
mation and categorization problem using a distributed 
and parallel optimization algorithm. Therefore, this 
approach also enables us to share the parts among 
different layers and avoid the propagation of object 
categorization and pose estimation errors through the 
layers. 

II. Fearned Hierarchy of Parts 

In this section, Fearned Hierarchy of Parts (FHOP)[7], [8] 
is briefly described. In FHOP, the object recognition process 
is performed in a hierarchy starting from a feature layer 
through more complex and abstract interpretations of object 
shapes to an object layer. A learned vocabulary is a recursive 
compositional representation of shape parts. Unsupervised 
bottom-up statistical learning is encompassed in order to 
obtain such a description. 

Shape representations are built upon a set of compositional 
parts which at the lowest layer use atomic features, e.g. 
Gabor features, extracted from image data. The object node 
is a composition of several child nodes located at one layer 
lower in the hierarchy, and the composition rule is recursively 
applied to each of its child nodes to the lowest layer Fi. 
All layers together form a hierarchically encoded vocabulary 
F = riur 2 U...uri. The entire vocabulary F is learned 
from the training set of images together with the vocabulary 
parameters [8]. 

The parts in the hierarchy are defined recursively in the 
following way. Each part in the layer represents the 
spatial relations between its constituent subparts from the 
layer below. Each composite part Vl constructed at the 
layer is characterized by a central subpart ’P^entraZ ^ 
list of remaining subparts with their positions relative to the 
center as 


Pfc - i'P’centrah 


( 1 ) 



where fij = {xj,yj) denotes the relative position of the 
subpart while denotes the allowed variance of its 

position around (xj,yj). 

III. Feature Extraction erom Learned Parts 

LHOP provides information about different properties of 
objects, such as poses, orientations and category member¬ 
ships, at different layers [7]. For instance, the information 
on shape parts, which are represented by edge structures and 
textural patterns observed in images, is obtained using Gabor 
features at the first layer Li. In the second and the following 
layers, compositions of parts are constructed according to 
the co-occurrence of part realizations that are detected in 
the images among different views of the objects and across 
different object categories. In other words, a library of object 
parts and compositions is learned jointly for all object views 
and categories. 

In order to obtain information about statistical and geo¬ 
metric properties of parts, we extract three types of features 
from the part realizations detected at each different layer of 
the LHOP. 

A. Histogram of Orientations of Parts 

Histograms of orientations of parts are computed in order 
to extract information on the co-occurrence of orientations of 
the parts across different poses of objects. Part orientations 
are computed according to a coordinate system of an image 
/ whose origin is located at the center of the image I, and 
the axes of the coordinate system are shown with blue lines 
in Fig. 1^ 

If we define = 1,2,..., Ff, VZ = 1, , 2 ..., L as the 

realization of the detected part in the layer at an image 
coordinate {xk,yk) of I, then its orientation with respect to 
the origin of the coordinate system is computed as 

9k I = arctan( —). 

Xk 

Then, the image / is partitioned into M cells {Im}m=i^ 
and histograms of the part orientations {Ok^ij^^i of the part 
realizations {pk,i}^=i that are located in each cell Im are 
computed. The aggregated histogram values are considered 
as variables of a Dp dimensional feature vector f^^^ 6 . 

B. Histogram of Oriented Gradients of Parts 

In addition to the computation of histograms of ori¬ 
entations of part realizations = 1,2,... = 

1,2,...,L, we compute histogram of oriented gradients 
(HOG) [5] of pj. in order to extract information about the 
distribution of gradient orientations of pj,, Vfc,Z. We denote 
the HOG feature vector extracted using {p^}^i in the Z*^ 
layer as f^^^ e , where Dh is the dimension of the HOG 
feature vector. The details of the implementation of HOG 
feature vectors are given in Section 



Fig. 2; An image is partitioned into cells for the computation 
of histograms of orientations of parts. A part realization pj. is 
depicted with a red point and associated to a part orientation 
degree Ok^i- 


C. The Entropy of Part Graphs 

We measure the statistical and structural properties of 
relative orientations of part realizations by measuring the 
complexity of a graph of parts. Mathematically speaking, 
we define a weighted undirected graph Gi := {Ei,Vi) in the 
Z‘^ layer, where Vi ■= {p^} is the set of part realizations, 
El •= {ek'.kJk' k=i edges, where each edge Ck'^k 

that connects the part realizations p^, and pj. is associated 
to an edge weight Wk',k, which is defined as 


Wk',k ■= arccos( 


pOSfc, • pOSfc 
||pOSfc,||2||pOSfe||2 


), 


where posj. := {xk,yk) is the position vector of p^^.,, 11-112 
is the ^2 norm or Euclidean norm, and pos^., • pos^, is the 
inner product of pos^., and pos^,. In other words, the edge 
weights are computed according to the orientations of parts 
relative to each other. 

We measure the complexity of the weighted graph by com¬ 
puting its graph entropy. First, we compute the normalized 
weighted graph Laplacian C [6], [16] as 


C = 


1 

K{K-l) 


i'D-W), 


where >V e is a weighted adjacency matrix or a 

matrix of weights Wk'.k, and 27 6 is a diagonal matrix 

K 

with members 27^ ^ := Y, Wk',k- Then, we compute the von 

k'=l 

Neumann entropy of G/ [6], [16] as 


S{Gi) = -Tv{C\og^C) 

K 

= 

fc=l 


( 2 ) 

( 3 ) 


where > ;/2 > ■ • ■ > > • ■ ■ > = 0 are the eigenvalues 

of C, Tr(£ log 2 G) is the trace of the matrix product C log 2 C 
and 01og2 0 = 0. We use S{Gi) as a feature variable := 
S{Gi). 






IV. Combination of Information Obtained at 
Different Layers oe LHOP for Joint Object Pose 
Estimation and Categorization 

In hierarchical compositional architectures, a different 
object property, such as object shape, pose and category, is 
represented at a different layer of a hierarchy in a vocabulary 
[15]. According the structures of the abstract representations 
of the properties, i.e. vocabularies, recognition processes 
have been performed using either a bottom-up [7], [8] or top- 
down [15] approach. It’s worth noting that the information 
in the representations are distributed among the layers in 
the vocabularies. In other words, the information about the 
category of an object may reside at the lower layers of 
the hierarchy instead of the top layer. In addition, lower 
layer atomic features, e.g. oriented Gabor features, provide 
information about part orientations which can be used for 
the estimation of pose and view-points of objects at the 
higher layers. Moreover, the relationship between the pose 
and category of an object is bi-directional. Therefore, an 
information integration approach should be considered in 
order to avoid the propagation of errors that occur in multi¬ 
task learning and recognition problems such as joint object 
categorization and pose estimation, especially when only one 
of the bottom-up and top-down approaches is implemented. 

For this purpose, we propose a generative-discriminative 
learning approach in order to combine the information ob¬ 
tained at each different layer of LHOP using the features 
extracted from part realizations. We represent the features 
defining a Dp + Dh + 1 dimensional feature vector = 
i^hop^^hog’ lent)- The feature vector f* is computed for each 
training and test image, therefore we denote the feature 
vector of the image li as f/, Vi = in the 

rest of the paper. 

We combine the feature vectors extracted at each 
layer for object pose estimation and categorization under the 
following Group Lasso optimization problem [19] 

L 

minimize - z||2 + A ^ ||a;; II2, ( 4 ) 

1=1 

where || • is the squared £2 norm, A 6 K is a regularization 
parameter, uji is the weight vector computed at the layer, 
T € is a matrix of feature vectors f,*, Vi = 1,2,..., N, 

VZ = 1,2,..., L and z = {zi,Z 2 , ■ ■ ■, zn) is a vector of target 
variables e M, Vi = 1, 2,..., iV. More specifically, Zi e 
il where H is a set of object poses, i.e. object orientation 
degrees, in a pose estimation problem. 

We solve Q using a distributed optimization algorithm 
called Alternating Direction Method of Multipliers [1]. For 
this purpose, we first re-write 0 in the ADMM form as 
follows 

L 

minimize ||JPf/) - z||f + A ^ ||u),II 2 ,,, 

i=i 

subject to LJi -cf)i = 0,1=1,2,...,L, 

where is the local estimate of the global variable 4> for 
uJi at the layer. Then, we solve (|5ll in the following three 
steps [1], [18], 


1) At each layer Z, we compute as 

:= argmin(p||/x‘||2 + A||u;;||2), ( 6 ) 

where /r* = - o)*) - + a* + , p > 0 

_( L 

is a penalty parameter, ^ 0* is 

the average of </>*, VZ = 1,...,L, and a* is a vector 
of scaled dual optimization variables computed at an 
iteration t. 

2) Then we update as 

:= 7^(z + + pa*]. (7) 

3) Finally, a is updated as 

a‘+i:=a*+^*-<^^i. (8) 

These three steps are iterated until a halting criterion, such 
as Z > T for a given termination time T, is achieved. 
Implementation details are given in the next section. 

In a C class object categorization problem, Zi € 
{1,2,. .., c,.. ., C} is a category variable. In order to solve 
this problem, we employ 1-of-C coding for sparse logistic 
regression as 


= l|f.) 


exp(Zij(f,)) 
l + exp(Zic(f*))’ 


( 9 ) 


where hc{U) = ft • is a weight vector associated to 

the c**^ category, = 1 if = c, 'ii = 1,2,... ,N. Then, we 
define the following optimization problem 


L N 

minimize -X! E ^ossi(i) + A||a;‘^||i, (10) 

i=li=l 

where lossi{i) = zfhc{fi) - log ^ exp(Zic(fi)) + l]- fn order 
to solve ®, we employ the three update steps given above 
with two modifications. First, we solve 0 for the £1 norm 
in the last regularization term A||a;i||i instead of the £2 norm. 
Second, we employ the logistic regression loss function in 
the computation of 0; as 

:= argmin(p||(/)j-JP;W/‘^^-a‘||2+log(l+exp-(L</>;))). 

( 11 ) 

In the training phase of the pose estimation algorithm, 
we compute the solution vector a; = (a;i,a; 2 , • ■ ■ ,u;i} using 
training data. In the test phase, we employ the solution vector 
u) on a given test feature vector f^ of the part realizations of 
an object to estimate its pose as 

Zi = fi ■ o). 

In the categorization problem, we predict the category 
label Zi of an object in the i*** image as 

Zi = argmax z“. 

C 









V. Experiments 

We examine our proposed approach and algorithms on 
two benchmark object categorization and pose estimation 
datasets, which are namely the Amsterdam Library of Object 
Images (ALOI) [10] and the Columbia Object Image Library 
(COIL-100) [17]. We have chosen these two benchmark 
datasets for two main reasons. Lirst, images of objects are 
captured by rotating the objects on a turntable by regular 
orientation degrees which enable us to analyze our proposed 
algorithm for multi-view object pose estimation and cate¬ 
gorization in uncluttered scenes. Second, object poses and 
categories are labeled within acceptable precision which is 
important to satisfy the statistical stability of training and 
test samples and their target values. In our experiments, we 
also re-calibrated labels of pose and rotation values of the 
objects that are mis-recorded in the datasets. 

We select the bin size {bSize) of the histograms and 
cell size M of HOP (see Section |III-Al l and HOG features 
(see Section |III-B| l by greedy search on the parameter 
set {8,16,32,64}, and take the optimal bSize and M 
which minimizes pose estimation and categorization errors in 
pose estimation and categorization problems using training 
datasets, respectively. In the employment of optimization 
algorithms, we compute A = aAmax, where Amax = ||-7^u;||oo, 
uj = {u)i,... II • IIoo is (.oo norm and a parameter 

is selected from the set {10~®, 10“^,..., 10^} using greedy 
search by minimizing training error of object pose estimation 
and categorization as suggested in [1]. In the implementation 
of LHOP, we learn the compositional hierarchy of parts and 
compute the part realizations for L = 1,2,3,4 [7]. 

In the experiments, pose estimation and categorization 
performances of the proposed algorithms are compared with 
state-of-the-art Support Vector Regression (SVR), Support 
Vector Machines (SVM) [2], Lasso and Logistic regression 
algorithms [12] which use the state-of-the-art HOG features 
[5] extracted from the images as considered in [11]. In 
the results, we refer to an implementation of SVM with 
HOG features as SVM-HOG, SVM with the proposed LHOP 
features as SVM-LHOP, SVR with HOG features as SVR- 
HOG, SVR with the proposed LHOP features as SVR-LHOP, 
Lasso with HOG features as L-HOG, Logistic Regression 
with HOG features as LR-HOG, Lasso with LHOP features 
as L-LHOP, Logistic Regression with LHOP features as LR- 
LHOP 

We use RBL kernels in SVR and SVM. The kernel width 
parameter a is searched in the interval log(cr) 6 [-10,5] 
and the SVR cost penalization parameter e is searched in 
the interval log(e) € [-10,5] using the training datasets. 

A. Experiments on Object Pose Estimation 

We have conducted two types of experiments for object 
pose estimation, namely Object-wise and Category-wise Pose 
Estimation. We analyze the sharability of the parts across 
different views of an object in Object-wise Pose Estimation 
experiments. In Category-wise Pose Estimation experiments, 
we analyze incorporation of category information to sharabil¬ 
ity of parts in the LHOP and to pose estimation performance. 


1) Experiments on Object-wise Pose Estimation: In the 
first set of experiments, we consider the objects belonging to 
each different category, individually. Lor instance, we select 
= 4 objects for training and N:°g = 1 objects for testing 
using objects belonging to cups category. The ID numbers 
of the objects and their category names are given in Table 
Lor each object, we have 72 object instances each of which 
represents an orientation of the object Zi = 0^ on a turntable 
rotated with 0, e H and H = {0°, 5°, 10°,..., 355°). 

In the experiments, we first analyze the variation of part 
realizations and feature vectors across different orientations 
of an object. We visualize the features f^gg and 

fent in Fig. I for a cup which is oriented with 0 6 
{20°,60°, 120°, 180°,240°,280°,340°} and for each I = 
1,2,3,4. In the first row at the top of the figure, the change of 
is visualized VZ. In the second row, the original images 
of the objects are given. In the third to the sixth rows, tj^gp 
are visualized by displaying the part realizations with pixel 
intensity values If^opUl for each I = 1,2,3,4. fj^gg features 
are visualized in the rest of the rows for each 1. 



Object Pose Degree 0 


Pig. 3; Visualization of features extracted from part real¬ 
izations for each different orientation of a cup and at each 
different layer of LHOP. 


In Pig. we first observe that /][[] values of the object 
change discriminatively across different object orientations 
0. Lor instance, if the handle of the cup is not seen from 
the front viewpoint of the cup (e.g. at 0 = 60°, 120°), then we 










































TABLE I: The samples that are selected from ALOI dataset and used in Object-wise Pose Estimation Experiments 


Category 

Name 

Apples 

Balls 

Bottles 

Boxes 

Cars 

Caps 

Shoes 

Objeet IDs 
for Training 

82 

103 

762 

13 

54 

157 

9 

Objeet IDs 
for Testing 

363, 540, 
649, 710 

164, 266, 
291, 585 

798, 829, 
831, 965 

110, 26, 
46, 78 

136, 138, 
148, 158 

36, 125, 
153, 259 

93, 113, 
350, 826 


observe a smooth surface of the cup and the complexity of 
the part graphs, i.e. the entropy values, decrease. On the other 
hand, if the handle of the cup is observed at a front viewpoint 
(e.g. at 0 = 240°, 280°), then the complexity increases. In 
addition, we observe that the difference between values 
of the object parts across different orientations 0 decreases 
as I increases. In other words, the discriminative power of 
the generative model of the LHOP increases at the higher 
layers of the LHOP since the LHOP captures the important 
parts and compositions that are co-occurred across different 
views through different layers. 



■ SVR-HOG ■ SVR-LHOP 
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Apples Balls Bottles Boxes Cars Mugs Shoes 

Pig. 4: Comparison of Object-wise Pose estimation errors (e) 
of the proposed algorithms. 

Given a ground truth 0 and an estimated pose value 
0, the pose estimation error is defined as e = ||0 -©Ill- 
Pose estimation errors of state-of-the-art algorithms and the 
proposed Hierarchical Compositional Approach are given in 
Fig. El In these results, we observe that the pose estimation 
errors of the algorithms which are implemented using the 
symmetric objects, such as apples and balls, are greater 
than that of the algorithms that are implemented on more 
structural objects such as cups. 

In order to analyze this observation in detail, we show the 
ground truth 0 and the estimated orientations 0 of some of 
the objects from Apples, Balls, cups and Shoes categories in 
Fig. HI We observe that some of the different views of the 
same object have the same shape and textural properties. For 
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Fig. 5; Results for some of the objects from Apples, Balls, 
Cups and Shoes categories obtained in Object-wise Pose 
estimation experiments. 


instance, the views of the ball at the orientations 0 = 10° 
and 0 = 225° represent the same pentagonal shape patterns. 
Therefore, similar parts are detected at these different views 
and the similar features are extracted from these detected 
parts. Then, the orientation of the ball, which is rotated by 
0 = 10°, is incorrectly estimated as 0 = 225°. 

2) Experiments on Category-wise Pose Estimation: In 
Category-wise Pose Estimation experiments, we select dif¬ 
ferent number of objects from different C number of 
categories as training images to estimate the pose of test 
objects, randomly. We employ the experiments on both ALOI 
and COIL datasets. 

In the ALOI dataset, we randomly select = 1,2,3,4 
number of training objects and = 1 test object which 
belong to Cups, Cow, Car, Clock and Duck categories. We 
repeat the random selection process two times and give the 
average pose estimation error for each experiment. In order 
to analyze the contribution of the information that can be 
obtained from the parts to the pose estimation performance 
using the part shareability of the LHOP, we initially select 
Cups and Cow categories (C = 2) and add new categories 
(Car, Clock and Duck) to the dataset, incrementally. The 
results are given in Table [I^ The results show that the 
pose estimation error decreases as the number of training 
samples, N:°g, increases. This is due to the fact that the 
addition of new objects to the dataset increases the statistical 
representation capacity of the LHOP and the learning model 
of the regression algorithm. In addition, we observe that the 
pose estimation error observed in the experiments for C = 2 
decreases when the objects from Car category are added to a 



























































dataset of objects belonging to Cups and Cow category in the 
experiments with (7 = 3. The performance boost is achieved 
by increasing the shareability of co-occurred object parts in 
different categories. For instance, the parts that construct the 
rectangular silhouettes of cows and cars can be shared in 
the construction of object representations in the LHOP (see 
Fig.§ 



Fig. 6: Sample images of the objects that are used in 
Category-wise Pose Estimation experiments. 


We employed two types of experiments on COIL dataset, 
constructing balanced and unbalanced training and test sets, 
in order to analyze the effect of the unbalanced data to the 
pose estimation performance. In the experiments, the objects 
are selected from Cat, Spatula, Cups and Car categories 
which contain 3, 3, 10 and 10 objects. Each object is rotated 
on a turntable by 5° from 0° to 355°. 

In the experiments on balanced datasets, images of 
number of objects are initially selected from Cat and Spatula 
categories (for C = 2), and then images of the objects se¬ 
lected from Cups and Car categories are incrementally added 
to the dataset for (7 = 3 and (7 = 4 category experiments. 
More specifically, objects are randomly selected from 
each category and the random selection is repeated two times 


for each experiment. The results are shown in Table III 


We observe that the addition of new objects to the datasets 
decreases the pose estimation error. Moreover, we observe 
a remarkable performance boost when the images of the 
objects from the categories that have similar silhouettes, such 
as Cat and Cups or Spatula and Car, are used in the same 
dataset. 


TABLE III; Category-wise Pose estimation errors (e) 
of SVR-HOG/SVR-LHOP/L-HOG/L-LHOP/Proposed Ap¬ 
proach for different number of categories ((7) and training 
samples (N:°^) selected from COIL dataset. 


^tr 

C=2 

C=3 

C=4 

1 

125/109/120/95/85 

120/85/103/77/68 

110/79/95/71/62 

2 

120/95/114/89/77 

93/77/81/63/59 

104/76/92/69/51 


We prepared unbalanced datasets by randomly selecting 
the images of = 1 object from each category as a test 
sample and the images of the rest of the objects belonging 
to the associated category in the COIL dataset as training 
samples. Eor instance, the images of a randomly selected cat 
are selected as test samples and the images of the remaining 
two cats are selected as training samples. This procedure 
is repeated two times in each experiment and the average 
values of pose estimation errors are depicted in Pig. |7] The 
results show that SVR is more sensitive to the balance of 


the dataset and the number of training samples than the 
proposed approach. For instance, the difference between the 
pose estimation error of SVR given in Table |m]andFig.|7] 
for (7 = 4 is approximately 10°, while that of the proposed 
Hierarchical Compositional Approach is approximately 5°. 
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Fig. 7: Category-wise Pose estimation errors (e) of the state- 
of-the-art algorithms and the proposed Hierarchical Compo¬ 
sitional Approach in the experiments on COIL dataset. 


In the next subsection, the experiments on object catego¬ 
rization are given. 


B. Experiments on Object Categorization 

In the Object Categorization experiments, we use the same 
experimental settings that are described in Section V-A.2 for 
Category-wise Pose Estimation. 


TABLE V: Categorization performance (%) of SVM- 
HOG/SVM-LHOP/LR-HOG/LR-LHOP/Proposed Approach 
using COIL dataset. 



C=2 

C=3 

C=4 

1 

94/93/92/95/100 

89/88/91/91/97 

81/79/80/81/84 

2 

97/97/96/97/100 

89/91/90/93/97 

84/86/83/87/90 


The results of the experiments employed on ALOI dataset 
and balanced subsets of COIL dataset are given in Table |IV] 
and Table [Vj respectively. In these experiments, we observe 
that the categorization performance decreases as the num¬ 
ber of categories increases. However, we observe that the 
pose estimation error decreases as the number of categories 
increases in the previous sections. The reason of the obser¬ 
vation of this error difference is that the objects rotated on 
a turn table may provide similar silhouettes although they 
may belong to different categories. Therefore, addition of 
the images of new objects that belong to different categories 
may boost pose estimation performance. On the other hand, 
addition of the images of these new objects may decrease the 























TABLE II: Category-wise Pose estimation errors (e) of SVR-HOG/SVR-LHOP/L-HOG/L-LHOP/Proposed Approach for 
different number of categories (C) and training samples (K°^) selected from ALOI dataset. 


1 

C=2 

C=3 

C=4 

c=s 1 

1 

133/103/140/97/91 

116/99/110/97/89 

110/95/102/95/88 

102/94/99/95/88 

2 

130/100/133/95/85 

108/93/104/88/81 

105/91/95/88/80 

100/94/100/91/85 

3 

105/91/104/86/75 

93/83/87/83/70 

99/86/94/84/75 

95/81/93/75/70 

4 

94/86/90/73/68 

90/79/84/73/65 

92/77/86/72/64 

95/75/88/71/60 


TABLE IV: Categorization performance (%) of SVM-HOG/SVM-LHOP/LR-HOG/LR-LHOP/Proposed Approach for dif¬ 
ferent number of categories (C) and training samples (K°^) selected from ALOI dataset. 


1 1 

1 ^tr 1 

C=2 

C=3 

C=4 

C=5 1 

1 

88/89/91/93/100 

85/88/84/92/98 

85/85/84/85/90 

81/81/81/83/90 

2 

88/91/92/94/100 

88/91/87/93/98 

87/87/86/88/92 

81/83/81/84/91 

3 

95/98/94/98/100 

91/93/91/95/99 

90/90/90/91/93 

83/85/83/88/91 

4 

97/98/98/99/100 

93/96/93/97/100 

90/91/90/91/94 

87/91/89/95/96 


categorization performance if the parts of the object cannot 
be shared across different categories and increase the data 
complexity of the feature space. 

VI. Conclusion 

In this paper, we have proposed a compositional hierar¬ 
chical approach for joint object pose estimation and catego¬ 
rization using a generative-discriminative learning method. 
The proposed approach first exposes information about pose 
and category of an object by extracting features from its 
realizations observed at different layers of LHOP in order 
to consider different levels of abstraction of information 
represented in the hierarchy. Next, we formulate joint object 
pose estimation and categorization problem as a sparse opti¬ 
mization problem. Then, we solve the optimization problem 
by integrating the features extracted at each different layer 
using a distributed and parallel optimization algorithm. 

We examine the proposed approach on benchmark 2-D 
multi-view image datasets. In the experiments, the proposed 
approach outperforms state-of-the-art Support Vector Ma¬ 
chines for object categorization and Support Vector Regres¬ 
sion algorithm for object pose estimation. In addition, we ob¬ 
serve that shareability of object parts across different object 
categories and views may increase pose estimation perfor¬ 
mance. On the other hand, object categorization performance 
may decrease as the number of categories increases if parts 
of an object cannot be shared across different categories, 
and increase the data complexity of the feature space. The 
proposed approach can successfully estimate the pose of 
objects which have view-specific statistical and geometric 
properties. On the other hand, the proposed feature extrac¬ 
tion algorithms cannot provide information about the view- 
specific properties of symmetric or semi-symmetric objects, 
which leads to a decrease of the object pose estimation and 
categorization performance. Therefore, the ongoing work is 
directed towards alleviating the problems with symmetric or 
semi-symmetric objects. 
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